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^^ Abstract 

^^ Obtaining a closed-form sampling distribution for the coalescent with recombination is 

^~^ a challenging problem. In the case of two loci, a new framework based on asymptotic 

series has recently been developed to derive closed-form results when the recombination 

rate is moderate to large. In this paper, an arbitrary number of loci is considered and 
combinatorial approaches are employed to find closed-form expressions for the first 
couple of terms in an asymptotic expansion of the multi-locus sampling distribution. 
(^ These expressions are universal in the sense that their functional form in terms of 

"tli the marginal one-locus distributions applies to all finite- and infinite- alleles models of 

a mutation. 
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^^ 1 Introduction 

l> ^^ 

■^ Coalescent processes, first introduced by Kingman p4l[15| about three decades ago, are 

1^ widely-used stochastic models in population genetics that describe the genealogical ancestry of 

f^ a sample of chromosomes randomly drawn from a population. For many applications, the key 

T— I quantity of interest is the probability of observing the sample under a given coalescent model 

^"^ of evolution. In the one-locus case with special models of mutation such as the infinite-alleles 

J> model or the finite-alleles parent-independent mutation model, exact sampling distributions 

; '"j have been known in closed-form for many years ||3]|22). In contrast, for models with two or 

rS more loci with finite recombination rates, finding an exact, closed-form sampling distribution 

^ has remained a challenging open problem. Therefore, most previous approaches have focused 

on Monte Carlo methods, including importance sampling |4,7 8 ,20| and Markov chain Monte 

Carlo p6][T9l[2T] . Such methods have led to useful tools for population genetics analysis, 

but they are in general computationally intensive and their accuracy is difficult to characterize 

theoretically. 

Recently, Jenkins and one of us | II 13) made progress on the long-standing problem of 



finding sampling formulas for population genetic models with recombination by proposing 
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a new approach based on asymptotic expansion. That work can be summarized as follows. 
Consider an exchangeable random mating model with two loci, denoted A and B. In the 
standard coalescent or diffusion limit, let 9a and 9b denote the respective population-scaled 
mutation rates at loci A and B, and let p denote the population-scaled recombination rate 
between the two loci. Given a sample configuration n (defined later in the text), assume that p 
is large and consider an asymptotic expansion of the sampling probability q{n \ 9a,9b, p) in 
inverse powers of p: 

q{n I 9a,9b,p) ^qo(n \ 9a,9b) + \ 5 1 , (1) 

p p^ 

where the coefficients go (''T' I (^A,()B),qi{n \ 0A,dB),q2{n \ 6'^ , 6'b ),..., are independent 
of p. The zeroth-order term go corresponds to the sampling probability in the p — 00 case 
(i.e., when the loci evolve independently), given simply by a product of marginal one-locus 
sampling probabilities fTl. For either the infinite-alleles or an arbitrary finite-alleles model 
of mutation at each locus, Jenkins and Song |11 12] derived a closed-form formula for the 
first-order term qi and showed that its functional form depends on the assumed model of mu- 
tation only through marginal one-locus sampling probabilities, a property which they termed 
universality. Further, they showed that the second-order term (72 can be expressed as a sum of a 
closed-form formula plus another part that can be easily evaluated numerically using dynamic 
programming; they also showed that, for most sample configurations, the closed-form part 
of 52 dominates the part that needs to be computed numerically. More recently, the same 
authors |fT3) utilized the diffusion process dual to the coalescent with recombination to develop 
a new computational technique for computing qk for all k > 1. Moreover, they proved that 
only a finite number of terms in the asymptotic expansion is needed to recover (via the method 
of Pade approximants) the exact two-locus sampling probability as an analytic function of p 
for all p g [0, 00). An immediate application of this work would be the composite-likelihood 
method |fT0][T7][T8J for estimating fine-scale recombination rates which is based on combining 
two-locus sampUng probabilities. 

The main goal of this paper is to extend some of the mathematical results described above 
to more than two loci. More precisely, we derive closed-form formulas for the first two terms 
{qo and qi) in an asymptotic expansion (described later in detail) of the sampling distribution 
for an arbitrary number of loci. In general, the number of possible allelic combinations grows 
exponentially with the number of loci, and the system of equations that we need to solve is 
considerably more complex than that in the case of two loci. Note that the details of the 
computational techniques developed in pT|[T2 | are specific to the case of two loci, and new 
methods need to be developed to handle an arbitrary number of loci. In this paper, we employ 
combinatorial approaches to make progress on the general case. Our work shows that the 
universality property of go and Qi previously observed 1 1 1 T2j in the two-locus case also 
applies to the case of an arbitrary number of loci. 

The remainder of this paper is organized as follows. In Section |2] we introduce the multi- 
locus model to be considered in this paper and describe our notational convention. Our main 
results are summarized in Section [3] and an explicit example involving three loci is discussed 
in Section HI In Section l5] we provide proofs of the main theoretical results presented in this 
paper. 

2 Preliminaries 
Below we describe the model considered in this paper and lay out notation. 
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Figure 1 : Illustration of L loci arranged linearly. The population-scaled recombination rate between 
loci I and Z + 1 is pi/2. 

2.1 Model 

We consider the diffusion limit of a neutral haploid exchangeable model of random mating 
with constant population size 2N. The haploid individuals in the population are referred to 
as gametes, and each gamete contains L > 2 loci labeled 1,2, ... ,L and laid out linearly 
as illustrated in Figure [T] The probability of mutation at locus / per gamete per generation is 
denoted by ui , whereas the probability of recombination between loci I and I + 1 per gamete per 
generation is denoted by c/. In the diffusion limit, as A^ — >■ oo we let ui — > 0, for 1 < I < L, 
and ci — > 0, for I < I < L — 1, such that 4Nui — ?> 6i and ANci — >■ pi, where di and pi are 
population-scaled mutation and recombination rates, respectively. 

Given a positive integer k, we use [k] to denote the fc-set {!,..., k}. At locus /, we assume 
that there are Ki distinct possible allele types, labeled by [Ki]. Mutation events at locus I occur 
according to a Poisson process with rate 9i/2, and allelic changes are described by an ergodic 
Markov chain with transition matrix P^ ' — (P^i,)', i.e., when a mutation occurs to an allele 
a G [Ki], it mutates to allele b G [Ki] with probability P^J . The stationary distribution of P^ ' 
is given by tt^'' with the ath entry Tri . 

Recombination events between loci I and I + 1 occur at rate pi/2. In our work, we are 
interested in the case where pi ^ 1, for all I E [L — 1], and have similar order of magnitude. 
Specifically, we re-express the recombination rates as pi — rip, where ri are scaling constants, 
and consider an asymptotic expansion as p ^- oo. 

2.2 Notation 

As detailed later, the standard coalescent with recombination implies a closed system of 
recursion relations satisfied by sampling probabilities. To obtain such a closed system of 
recursions, the allelic type space must be extended to allow gametes to be unspecified at some 
loci. We use the symbol * to denote an unspecified allele and define the L-locus haplotype set 
Ti. as 

^ = (([i^i] u W) X • • • X ([i^^] u W)) \ {*n- 

Given a haplotype h £ T-L,we use hi £ [Ki] U {*} to denote the allelic state of h at locus I. In 
what follows, we introduce definitions that are used throughout the paper 
The following two definitions explain how we denote samples: 

Definition 1. (n and e^, sample configurations.) A sample configuration is denoted by n = 
{nh)he'H^ where Uh is the number of times haplotype h occurs in the sample, and the same 
letter n in non-boldface is used to denote the total sample size of n; i.e., n = J^hen "''■ ^^^ 
symbol e^ is used to denote a sample configuration of size 1 for which n/i = 1 and Uh' = 
for all h' y^ h. Note that we can write n — J^hen ''^hGh- 

Definition 2. (n"' and a{n), marginal sample configurations.) Let n = {nh)he'H be an L- 
locus sample configuration. For 1 < I < L, the marginal sample size for locus / and allele 
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a € [Ki] is defined as ni ~ llhen-h =a "•'i' ^'^'^ '■^^ marginal sample size for locus I is 
defined as n^'' = 'l2aelK 1 "" ^i-^-' '■^^ '■^'■^^ number of haplotypes with specified alleles at 
locus 0- Further, we use n^'-* — ('^a )ae[K,] to denote the iiT; -dimensional vector specifying 
the marginal sample configuration for locus I, and use cr(n) = {n'^^\ . . . , n*^^') to denote 
the I/-tuple of marginal sample configurations. Also note that if hi = *, then eJ^Ms a Ki- 
dimensional zero vector. 

The sets described in the following two definitions specify where mutations or recombina- 
tions can occur in a given haplotype: 

Definition 3. {S{h), specified loci.) For each locus /, the alleles labeled by [Ki] are called 
specified alleles. Further, given a haplotype h G H, we use S{h) C [L] to denote the set of 
loci at which h has specified alleles (i.e., not *). 

Definition 4. (B{h), break intervals.) When considering recombination, a given haplotype 
h £ H can be broken up between loci I and I + I if min(S'(ft.)) < I < I + 1 < niax{S{h)). 
The index I is used to refer to the break interval {1,1 + 1), and the set of valid break intervals 
for haplotype h is denoted by B{h) = {min(S'(/i)), . . . ,max{S{h)) — 1}. 

The two relations described below compare haplotypes. When two haplotypes satisfy either 
relation, then their corresponding lineages are allowed to coalesce. 

Definition 5. (X, compatibility.) Given a pair of haplotypes h, h' E H, if hi = h[ for all 

I G S{h) n S{h'), then we say that they are compatible and write h^h'. 

Definition 6. (y, containment.) Given a pair of haplotypes h, h' E %, we write h > h' \f 
S{h) D S{h') and hi = h[ for all I e S{h'). 

Corresponding to the types of event that may occur in the coalescent with recombination, 
we define the following operations on haplotypes: 

1. (Mutate): Given a locus I E [L] and an allele a E [Ki], define Mj^{h) as the haplotype 
derived from h E Hhy substituting the allele at locus I with a. 

2. (Coalesce): \fh)\h', define C(/i, h') as the haplotype h" constructed as follows: 

hi, if hi y^ * and /ij = *, 

K' ^ { K^ if hi = * and h[ =^ *, 

hi = h'l, otherwise. 

3. (Break): Given a break interval I E B{h), we use Rf{h) — {hi, . . . ,hi,*, . . . ,*) 
to denote the haplotype obtained from h by replacing hj with * for all j > I + 1, and 
Rl{h) — (*,... ,*,/i;+i, . . . ,hL) to denote the haplotype obtained from /i by replacing 
hj with * for all j < I. 

Given a haplotype h E H and a set X C [L], we define H{h,X) as the set of haplotypes that 
contain h and are specified at the loci in X; i.e., 

H{h, X) ^ {hf E n I S{h') D X and h' ^ h}. (2) 

Lastly, for a given subset X C [L], we define r{X) as 

r{X) = r^in(X) + rmin(X)+l H ^ r^ax{X)-l, 
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which corresponds to the total recombination rate (relative to p) between the first and the last 
loci in X. 



3 Main results on multi-locus asymptotic sampling distributions 

For ease of notation, in most cases we suppress the dependence on the parameters {ri}i~^ 
and {di, P^ l/^Li when writing sampling probabilities. By exchangeability, the probability of 
any ordered configuration corresponding to sample n is invariant under all permutations of the 
sampling order. Hence, we use q{n) without ambiguity to denote the stationary sampling 
probability of any particular ordered configuration consistent with n. From the standard 
coalescent with recombination ||2]|5]|6][8]|9), one can derive a closed system of recursions 
and boundary conditions for which q{ri) is the unique solution. Specifically, q{n) satisfies the 
following system of linear equations: 



hev. 



Y^rth {n-l)+ Y. ^1+ Y. Pi 
les(h) leB(h) ■ 



q{n) 



: ^ n,, (n,i - l)g(n - e,j) + ^ nh'q{n - et - Sh, + ec{h,h')) 

hen L h' eU: h Kh' Ji^h' 

l£S{li) ae[Ki] 



leB(h) 



with boundary conditions 



q{eh)= n 4?'forall/ieH. 



(3) 



(4) 



l£S(h) 



We define q{n) —Qifrih < for any h E H. 

The above closed system of equations is a full-rank linear system in the variables q{m) for 
all samples m reachable from the given sample n through repeated application of ([3]|. Since 
Pi — Tip, the entries of the matrix associated with the linear system are linear in p. Hence, the 
entries in the inverse matrix are rational functions of p, thus implying that q{n) is a rational 
function of p, say f{p)/g{p) where / and g are polynomials that depend on n and rj. Also, 
for every sample configuration n, since < ^(n) < 1 as p -> cx), / and g must be of the 
same degree in p. Hence, it follows that q{n) is also a rational function of p^^ with both the 
numerator and the denominator having non-zero constant terms. Hence, the Taylor series of 
q{n) about p = oo gives the following asymptotic expansion in inverse powers of p: 



qin) =go(n) + 



qi{n) q2{n) 



O 



1 



(5) 



where the coefficients qi^{n) , qi{n) , 92 ('".), . . . , are uniquely determined, and they depend on 



the sample configuration n and the model parameters {0;, P^ }/Li ^nd {ri}i^^ , but not on 
p. Note that qQ{n) corresponds to the sampling probability when p is infinitely large, in which 
case all haplotypes instantly break up into one-locus fragments and evolve independently back 
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in time. Hence, as proved by Ethier [1] in the case of two loci, we expect qq (n) to be given by 
the product of marginal one-locus sampling probabilities. The following result formalizes this 
intuition: 

Proposition 1. For all L-locus sample configurations n, 

L 



90 



in)=l[pin<^'Uei,P^'y), 



(6) 



1=1 



where p denotes the marginal one-locus sampling distribution. 



Remark: An exact, closed-form expression for the one-locus sampling distribution p{rv-^'^ \ 
6i,P^') is not known for general finite-alleles mutation models. However, if a finite-alleles 
parent-independent mutation model is assumed at each locus (i.e., for each locus I E [L], the 
mutation transition matrix satisfies P^fl — tt^ for all a,b € [Ki]), then in (rol) one can use 
Wright's p2| one-locus sampling formula 






where (x)„ — x{x + 1) • • • (a: + n — 1). 

A proof of Proposition [T] is provided in Section |5.1| Since go {n) depends only 
marginal sample configuration a{n) = (n*^^-', . . .~JtJ^^), henceforth we use qa{; 
go(o'(n)) interchangeably. 

In SectionB^ we apply the inclusion-exclusion principle to derive the following key result: 



on the 

qviin) and 



Proposition 2. The qi{n) term in the asymptotic expansion (|5]l ofq(n) is the unique solution 
to the recursion 



hen leB(h) 

= E 



qi{n) ~qi{n-eh + e^-^^, + e^+(^)) 
qo{<j{n) -cr{eh)) 



E ("D 



\X-S{h)\ 



X:\X\>2, 
S(h)CXC[L] 



E "'»' 

\h'eH{h,X) 



J2 "Z^" 

\h"eH(h,X) 



with boundary conditions 



qi{eh) = 0, for all h eH. 



(7) 



(8) 



,(0 



We define qa{<T{n)) = ifria < for any I e [L] and a e [Ki]. 



In Section 5.3 we prove that the closed-form expression for gi (n) in the following theorem is 
the unique solution to (jTll and (IHll: 
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Theorem 1. Recursion ([7]) and boundary conditions ([8]) admit the following unique solution 
forqi{n): 



11 



(") = XI qo{(j{n) - a{eh)) ^ 



(_1)|X-5(.)| /^^,^^ 



hen\j{*^} 



X:\X\>2, 
S(li)CXC[L] 



iX) 



'eH{h,X) "^h' 

2 



, (9) 



where qq is given by a product of marginal one-locus sampling distributions as described in 
PropositionU] 

The intuition behind Proposition Inland TheoremfTlis as follows: In 1 12|, a formula for qi{n) 
in the two-locus case was obtained by deriving a recursion satisfied by qi (n) and by solving it 
using a probabilistic interpretation based on multivariate hypergeometric distributions. The 
correct multi-locus generalization of the two-locus recursion for qi{n) used in |12| turns 
out to be the inclusion-exclusion type expression shown in Proposition l2] and an appropriate 
generalization of the associated probabilistic interpretation is based on Wallenius' noncentral 



5.3| we provide a 
4] that the general 



hypergeometric distributions. For ease of exposition, however, in Section 
purely combinatorial proof of Theorem [T] For two loci, we show in Section 
multi-locus solution for qi (n) in (J9]) reduces to the solution found in [ 12 1. 

In summary, Propositionflland TheoremfTlimply the following asymptotic expansion of the 
L-locus sampling distribution: 

Corollary 1. For an arbitrary L-locus sample configuration n, the sampling probability q{n) 
in the limit p — > oo has the following asymptotic expansion: 



q{n) ^\{p{n^')) 



1=1 



E 



n^ 



(I) ^J'h 



n^ ' — e 



O 



heHu{*^} u=i 
1 






X:\X\>2, 
S(h)CXClL] 



r{X) 



where p{'n^'' ) denotes the marginal one-locus sampling probability for locus I with parameters 
OiandP'-^K 

Note that the formulas for qo (n) and qi (n) shown in Proposition [T| and Theorem W\ re- 
spectively, do not have any explicit dependence on the mutation parameters. More precisely, 
the dependence on the assumed mutation model arises only implicitly through the one-locus 
sampling probabilities ^(71^''), and the formulas in Proposition 111 and Theorem 111 apply to 
all finite-alleles mutation models. In fact, by carrying out a similar line of derivation as that 
presented in this paper, it can be shown that the formulas in Propositionflland Theorem [T] also 
apply to the case of the infinite-alleles model of mutation at each locus; the marginal one- 
locus sampling probabilities p{n'--'^ ) in that case are given by the Ewens sampling formula P). 
Jenkins and Song 1 1 1 1 observed this universality property of go and qi earlier in the case of 
two loci. Our results imply that the universality property extends to an arbitrary number L of 
loci. 
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4 An explicit example: the three-locus case 

Below we provide an explicit formula for qi (n) in the case of L = 3. For ease of notation, 
we adopt the convention that the indices i,j, and k denote specified alleles which range over 
[Ki] , [K2] , and [^3], respectively. Hence, riijk denotes the number of fully specified haplotype 
{i,j, k). As in the rest of this paper, the symbol "*" represents an unspecified allele. Finally, 
the symbol "•" for the index corresponding to locus / represents a summation over all the 
alleles in [Ki], while the symbol "•" denotes a summation over [Ki] U {*}. For example, 
n^*. = Yjk(^[K3\ ""fc ™d ^*'- = SjG[X2] SfeGl-ftTa] "ufc' whereas rii,. = ni^. + n^... In this 
notation, TheoremfTlimplies that 91 (n) for L = 3 is given by 



qi{n) =qn{(j{n)) 


■ 1 


fn..,\ 1 
V 2 J ^ ^2 


fn,..\ 1 /n.,.\ 


1 


V2 


Y 
y. 




V 2 / r-i + r2 V 2 / ri + rj 




i 


-o-(e„*)) 


1 /^n,..\ 1 /n, 




2 J. 


Ti +^2 V 2 / ri + r2 V 2 




- cr{e^j*)) 


1 /n.,.\ 1 /n.j,\ 1 /n.,. 
.ri+^2 V 2 y ri V 2 y r2\ 2 


)] 




k 


- CT(e**fc)) 


1 /n..fe\ 1 /n 


..A 1 

2 y r2 


/"■•■/c^ 


Ti +7-2 V 2 / ri +r2 V 


V 2 J 




-o-(e.y*)) 


1 /ny.\ 1 /n,,.\l 






r-i V 2 y ri+r2 V 2 y 






- o-(e*jfe)) 


' 1 fn'jk\ 1 /'^jfcA 






T2V 2 y ri+r2V 2 y 




+ ) '9o(o-(n; 

i.k 


- cr(e„A;)) 


1 fn,,k\ 1 /n 


;')] 






.J"! +^2 V 2 y ri +r2 V 




hJ,k 


T{n 


)-<y{e^Jk))- 


1 /"iifc\ 

•1 + r2 V 2 y ' 











(10) 



where go is given by a product of marginal one-locus sampling probabilities. If the sample 
does not contain any haplotype with an unspecified allele *, ( [T0| reduces to the following: 



9iH = (— + —] '?o(c^("))f "2 



1 1 

\ 

ri ^2 ri + r2 



V" '7o(o-(") - cr(ei**)) 

ri ^-^ 

i 

V go(fT(n) - (T(e„fc)) 

ro -"^ — ' 

:^ ) y" qoicrin) - a(e,jfe)) ( '^„^'' ) 

'2 ?-i + ?'2 y ^ ' V 2 y 



n..k 
2 



2 
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If the second locus is ignored, or equivalently, if every haplotype in the sample has an unspec- 
ified allele * at the second locus, ( [TO] l becomes 



1 



qi[n) 



ri +r2 



9o(CT(n)) 



n.*. 



Yqo{<T(n) -cr(ei„)) 



Y^ (?o(cr(n) - cr(e,*fe)) y*''] 



^qo(a'(n) - (j{ei^k)) 



t^i^k 



which coincides with the formula found by Jenkins and Song pT|[T2) in the case of L = 2 



5 Proofs of main results 

In this section, we provide proofs of the results described in Section [3] For a given locus 
I G [L] and an allele a G [Ki], we use it^ to denote the i^T; -dimensional unit vector where the 
jth component is 1 if j = a and otherwise. 

5. 1 Proof of Proposition [T] 

By substituting the asymptotic expansion (|5]l into recursion ([3]l, dividing by p, and letting 
p — >■ oo, we obtain the following recursion for qo{n): 



'-h€'H iGB(h) -' hen lGB{li) 

We first estabUsh the following lemma: 



^B-W 



-^B^(h)) 



(11) 



Lemma 1. For every L-locus sample configuration n, qo(n) depends only on the marginal 



sample configurations CF{n) = {n 



= (nW 



,n 



(L) 



)•• 



qoin) ^qo{a{n)), 



(12) 



where (T{n) is viewed as a sample configuration containing X^/ip-w ''^h\S{h)\ haplotypes each 
with a specified allele at exactly one locus and unspecified alleles elsewhere. 

Proof. We use induction on the number of recombination events needed to transform a 
given sample configuration n into the configuration a{n) that contains J^hen "■'i|'5^(''-)l hap- 
lotypes each specified at exactly one locus. The base case corresponds to a sample n consisting 
of haplotypes each specified at only one locus, in which case n — (T{n), and (fT2| is trivially 
true. Given a sample configuration n, consider the right hand side of ( 11 1. For any haplotype 
h E H satisfying Uh > and any I g B{h), let m — n — eh + e^-,,^ + ej^+f^)- ^^^ 
sample configuration m needs one less recombination event to be transformed to <T{m) than 
n needs to be transformed to a{n). Hence, applying the induction hypothesis to m, we have 
qo{m) = qo{a{m)). Noting that a{m) = cr(n) and using (fTTji, we have 



hen ieB{h) 



9o(") = X! "-'* X! '''9o(CT(n)), 
hen ieB{h) 



10 
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which simplifies to ( 12 1 



D 



Now, let m denote a sample configuration such that ruh = for all h with more than one 
specified locus (i.e., |5(/i)| > 1). Substituting the asymptotic expansion Q f or n = m into 
(|3]l, using ( 11 1 to simplify, letting p — > cx), and utilizing LemmafT] we obtain the recursion 



E 

1=1 



m(')(m(')-l) + 6';m(') 



qo{m 



(1) 



.,m(^)) 



1 = 1 a(i[Ki] 

+ il(^i E PiW'?o(mW,...,m('-i),mW-«i + ni,m('+i),...,m(^)), 

1=1 a,be[Ki] 

(13) 
and boundary conditions 

go (0(1), . . . , 0('-i),«i,, 0('+i), . . . ,0(^)) = 4',), for all I G [L] and a, G [if,], (14) 

where o'^-* denotes the ii'j -dimensional zero vector. Notice that recursion ( pj] ) is the sum of L 
one-locus recursions of the form 



m(')(m(')^l) + 6',m(') 



p(m('))= Y. m«(mW-lMm«-0 

[Ki] 



a£[KA 



a,b£[Ki 



while boundary conditions (|14 1 are a product of one-locus boundary conditions p(Mq) = ttI 
and p(0(')) = 1 for all Z G [i and a G [Ki\. Hence, it follows that qo{m^^\ ... , m'^^^) = 
rii^iPl'"^*''^)- Finally, together with Lemmalll letting m^'^ — n*^') for all 1 < ^ < L in the 
above result implies (7o (''T') = <1o{'>T'^^\ . ■ . ,n^^^) = nz=iP('^ *''')■ '-' 

5.2 Proof of Proposition |2] 

By an induction argument similar to that in the proof of LemmafT] one can see that recursion 
(J7]i and boundary conditions ([8]) have a unique solution. Substituting (|5]l into both sides of (|3]l, 
using (111, and taking the limit as p — > oo, we obtain 



^ n/j (71-1)+ ^ 01 qn{n)+Ynh ^ riqi{n) 
h^H '- ies(h) -' hen ieB{h) 

= E ""'* ("/i - l)9o(" - e;0 + E Uh'qoin- Eh- eh> +ec(h. 

h^H '- h' eU: h Kh\h^h' 

leS{h) ae[Ki] 
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leB(fi) 



(15) 



The terms that depend on mutation parameters can be eliminated by setting m — a{n) in ( |T3] l, 
subtracting it from ( [T5| l, and using the property of qo{n) that it depends only on the marginal 
sample configuration at each locus. As a consequence, the following simpler recursion can be 
obtained: 



J2nh J2 ^' 
hen ieB{h) 



qi{n) -qi{n-eh + Bj^-^,^^ + e^+(^)) 



y^ nhnh'qo{n- Eh- Bh' +ec{h,fi')) 

ri,h'e'H: hXh',fi^fi' 

L 

+ "^ nh{nh - l)qo{n - Bh) - n{n - 1) - "^ n'^'^^n'^^^ - 1) qo(^ 



hen 

L 



1=1 



e-i) r,.(0 - „.' n('+i) 



i = l ae[A',] 



n(^)). (16) 



We also have the boundary conditions qi{Bh) = for all h E H since q{Bh) = qo{Bh). 

As the left hand side and boundary conditions of ( 16 1 are identical to that of (JTl), it suffices 
to establish that their right hand sides are also identical to show equivalence. Note that the 
right hand side of (|7]i can be written as follows: 



Y: qoiain) - a{B,)) Yl (-l)'^-"''^)' 
tienul*^} X:\x\>2, \h'eH{fi.x) 



X:\X\>2, 
S(h)CXClL] 



E "''' E "'^" - 1 

h'eH(li.X) / \h"eH(h,X) / 



E qa{cr{n) ~ a{Bh)) 
heHu{*^} 



E (-1) 



\X-S(ti)\ 



. X:\X\>2, 
S{li)CXClL] 



E nh'Uhn 

^h'M"eH{h,X) 



E qo{<jin) - aiBh)) 
h.enu{*^} 



E (-1) 



\X-S{h)\ 



. X:\X\>2, 
S(h)CXC[L] 



E "''' 
\h'eH(h,X) 



(17) 



The first term on the right hand side of ( [T7| can be rewritten as 



E 9o(CT(n) - (j{Bh)) 

?iG-HU{*^} 



E (-1) 



|x-s(/oi 



L X:\X\>2, 
S{h)<ZX(Z[L\ 



\h',h"eH{}i,X) 



= E go(o-(n)-(T(e,0) 
hewu{*i} 



.h',h"yii \ X:\X\>2, / . 



E '?o(o-(«-) -o-(eh))'| E ^'^h'nh" 
henui*^} 



h'h"yii 



X:\X\>2 
S{h)CXC{S{h')nS{fi")) 

y^ l^_^yx-sih)\ 

.X:S{h)CXC{S{h')nS{h")) 
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E 



^_^-^\X-Sih)\ 



y^ ^_^yx-sih)\ 



X:\X\=0, 
Sih)CXC{S{h')nS{h")) 



X:\X\=1, 
S{h)CXC(S{h')nS(h")) 



(18) 



where the first equality follows because, by definition (J2]), the condition that h' G H{h, X) is 
equivalent to h' ^ h and X C S{h'), and similarly for h". Now, by the inclusion-exclusion 
principle, 

z2 (-i)!-^-^^)! ^ Ss{h'}ns{h"},sih), 

X:S{h)CXC{S(h')nS{h")) 

where for any sets A and B, 5a,b = 1 if A = _B, and Sa,b = otherwise. Then the second to 



last line of (1 8 1 simplifies to 



E 9o(o-(n) - (T{eh)) 



E 



nh'nh"i ^ (-1) 

.h',h"yh \x:S{h)<ZX<Z{S(h')nS{h")) 



\X-S(h)\ 



E 9o(CT(n)-cr(e,0) 



/ , nh'nh"Ss(h')ns{h"),sih) 

Lh',h"yh 

h' .h" -.h' X h" 

= ^ rih>nh>'qo{n - e^, - eh» + ec{h',h")) 

h' .h" :h' X h" 

= ^ nh'nh"qQ{n-eh, -Eh" +ec(h',h"))+^nlqQ{n-eh), 

h'.h":h' X_h"M'^h" heU 

where the second equality follows because S{h) = S{h') n S{h") and h' , h" > h imply that 
h' and h" are compatible by definition, and hence cr{eh) — <j{eii') + cri^h") — '^{^c(h' ,h"))- 
The third equality holds because q^ depends only on the marginal sample configurations, and 
the last equality follows because when h' = h" = h, C{h' , h") = h. The terms in the last line 



of ( 18 1 can be simplified as follows: 



E qo{cr{n)-cr{eh)) 



E "''' 



nh" 



y^ ^_^-^\x-s{h)\ 



.h',h"yh 



X:\X\=0, 
S{h)<ZXG{S{h')nS{h")) 



n^qoin), 



since the only h, h' , h" and X that satisfy the conditions of the inner summation over X are 
h — {*^}, h' , h" e "H and X — 0. Furthermore, one can also see that 



E qoicrin) ^ a{eh)) 



E -^-^^"( E (-i)i^-^(")i 

h',h"yh \ X:\X\ = 1, / 

S{h)CXC{S{h')nS{h")) 

1 = 1 aelK,] 1=1 



since the only h, h' , h" and X that satisfy the conditions of the summation over X are: 



Multi-locus asymptotic sampling distributions 



13 



• X = {1} for some I G [L] and h = {*^}. Because of the condition that X C {S{h') n 
S{h")), h' and h" range over all haplotypes that are specified at locus /. 

• X — {1} for some I e [L] and h such that hi — a for some a G [iiT/] and hi' = * for 
all /' ^ /. Because h' , h" > h wA X (Z {S{h') f] S{h")), h' and h" range over all 
haplotypes with allele a at locus /. 



In summary, ( 18 1, which corresponds to the first term on the right hand side of (17 1, can be 
written as 



^ qQ{a{n) - a{eh)) 



E (-1) 



\X-S{h)\ 



. X:\X\>2, 
S(h)(ZX(Z[L\ 



\h'Ji"eH{h,X) 



y^ nh'nh"qo{n- eh, +eh„ - ec(ii',h")) + ^^n\qQ{n~ Eh) 

h.\li":h' }yh.",h'^h." heti 

1=1 
-E E (4'^r9o(n(^\...,n('-^\n«-«i,n('+i),...,nW). (19) 

1=1 a(^[Ki] 



By following similar steps as above, the second term on the right hand side of (17i can be 
written as 



E qo{(T{n) - a{eh)) 



E (-i: 



\X-Sili)\ 



■ X:\X\>2, 
Sih)CXC[L] 



E "'«' 

\h'eH(h,X) 



= E qa{(j{n) - a{eh)) 
h.e«u{*-^} 



E"-f E M) 

/I'^/i \X:S(/i)CXCS(/i') 



|X-S(ft)| 



E (-1) 



|X-S(/i)| 



E 



(-1) 



\X-S(h)\ 



X:\X\=Q, 
S{h)<ZX<ZS{ri') 



X:\X\ = 1, 
S(li)CXC{S{li')nS{h")) 



E qo{(Tin) - a{eh)) 






E (-1) 



|X-S(/i)| 



E 



(-1) 



\X-S{h)\ 



X:\X\=0, 
S{h)<ZX<ZS(h') 



X:\X\ = 1, 
S(h)CXC{S{h')nS{fi")) 



y^ nhqo{n - e^) - nqo(n) + ^ n^'^(7o(") 



/tew 



(0„„i „(i+l) n^L)^ 



1=1 ae[Ki] 



(20) 
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where the second equality follows from the inclusion-exclusion principle. Finally, subtracting 
( [20) 1 from ( [T9] ), we see that the right hand side of (JTl) is equal to the right hand side of ( 16 1. □ 

5.3 Proof of TheoremU] 

We first show that the boundary conditions ([8]l are satisfied by (|9|: If n = e^ for some 
g E H, then in the right hand side of (|9]l, the only h that can potentially contribute to the 
summation are those satisfying g >: h, since otherwise qQ{a{eg) — cr(e^)) ~ 0. However, 
for g >z h, if S{h) C X C S(g), then J2h"eH(h x) ^h" — 1 = since g e H{h, X), and if 
X % S{g), then J^i 'pHih x) "'i' ^ ^ since g ^ H{h, X). Therefore, in the right hand side of 



l^h'£H{h,X) "■'!■' 

2 



\h'eH(h,X) / \h"eH(h.X) / 



and so qi (cg) =0 for all 5 G "H. 

We now show that the recursion ^ is satisfied by Q. Substituting Q into the left hand 
side of (|7|, we obtain 



E"9 E '^^ 

gen ieB{g) 



qi{n) -qi(n- Bg + e 



Rlig) +^fl,+ (9)) 



= E"9 E '^^ E 9o(c^(w)-cr(e,,))<j ^ 



X:|X|>2, 

s(;i)cxc[L] 



(■_2)I'^-S'WI 



]{ E Wr 



E "/^ 

lZh'eH(h.X) ^h'M-(g) + "h',R+[g) ~ °'i' 



(g)+^/i",i?+(g) -'^'»",9 



where for any haplotypes g and /i, (5g.h = lif g = h and (5g_/, = otherwise. Also, in the 
previous line, by (2) we mean x{x — l)/2 for all a; e M. Note that if g ^ H{h,X), then 
R-ig),R+{g)iH{h,X),eindso 



E ^h',B-ig)+h'Mt{g)-^h' 



0. 



h'eH(h,X) 



Interchanging the summation over g, I and X, and introducing the restriction that g G H{h, X) 
(i.e., S{g) I) X and 5 )^ h), we obtain 



E"9 E '^' 



qi(n) -9i(n-eg + e 



R-(g)+^R+(g)) 



E '?o(CT(n) - a(eft)) <^ ^ -— ^ "s E ^' 



/ie«u{*-f-} 



Js::|X|>2, 

s(/i)cjs:c[L] 



r(X) 



geH(h,X) leB{g) 



E "'»' 
yh'eH{h,X) 



E 

.h"<£H(h,X) 



V.fl-(3)+V'.fl+(g)-^'^",9 
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Figure 2: Illustration of sub-cases considered in the proof of Theorem [T] Here X C S{g), where 
g >Z h. Squares denote the loci in S{g), while shaded squares denote the loci in X (and hence also in 
S{g)). Circles denote the loci not in S{g) (and hence not in X). A squiggle denotes the recombination 
break interval I considered in each case. The s quar es to the left and to the ri ght o f the squiggle respectively 
denote the loci in S{R^{g)) and S{R^{g)). (a) Case with I < min(>'f). (b) Case with min(X) < / < 
max(X). 



l^h'eH{h,X) "h',R^(g) 



'h',R+{g) 



^ ^h',c 



(21) 



Now, for g e H(h, X), note that 



h'eH{h,X) 



(22) 



We utilize this identity in the ensuing discussion. There are three cases for the recombination 



break interval I G B{g) in the right hand side of (21 1: 



1. I < min(X): This case is illustrated in Figure 2(a) 



Note that J2h'eH{fi,x) K\Ri(g) == since S{Ri {g)) n X = and hence i?, [g) i 
H{h,X). Also, Eh'eHiKX) K'Mt[g) = 1 since g > h and 5(i?+(.g)) D X D S{h), 
and so R^{g) >: h and R^ig) G H{h,X). Hence, together with (22 1, we conclude 



Z-^h'eH(h,X) °h',R.^{g) 



+ ^h'Ji+(g) 



^h',g 



0. 



2. max(X) < I. By a similar argument, we see that J^h'eH 

Sh'.n = 0. 



ifi.X)"h',R-{g)^"h'M+{g)- 



3. min(X) < Z < max(X): This case is illustrated in Figure 2(b) 



Nolelhal J2h'eH(h,X)Sh',R-(g) 



since max(X) ^ S{Ri {g)) and hence R^ {g) ^ 



{h,X)"h',R+ig) 



— since min(X) ^ S{R'^{g)) and hence 



H{h,X). Similarly, J:,^,^^ 

R'lig) ^ H{h, X). Therefore, upon using (22 1, we conclude J2h'eH(h x) ^h' Rr (q) + 



^h'Mt(g) 



Sh' 



-1. 



Partitioning the summation over I G B{g) in the right hand side of expression (21 1 into 
the above three cases and noting that only the third case gives a non-zero value for the term 

T,h'eH(h^x) k',R-{g) + ^h'M+(g) " ^h' ,g, we obtain 



/ , '"g /_^ 
gen ieB{g) 



ri 



qi{n) -qiin-Sg 



^R-(g) +^Rt{g)' 
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J2 qoicr{n) ~ a{eh)) ^ 
/ie-Hu{*^} 



(^_iyx-s{h)\ 

inax(X) — 1 
geH(h,X) l=-min{X) 



X:\X\>2, 
S{h)<ZX(Z[L] 




E 



hev.u{*^} 



qoia{n) - a{eh)) 



E 

X:\X\>2, 
S(h)CXC[L] 



(-1) 



\X-S{h)\ 



E 



.h'eH{h,X) 




J2 '^h" 

h"eH{h,X) 



This is the expression on the right hand side of (|7]), and thus we have shown that (J9]l satisfies (|7]l. 
Hence, the proposed solution for qi{n) in Theorem [I] is the unique solution to the recursion 
(|7]l and boundary conditions (ISll. □ 

Acknowledgments 

We thank Paul Jenkins, Jack Kamm, and Matthias Steinriicken for useful discussion and 
comments, and an anonymous reviewer for helpful comments. This research is supported in 
part by an NIH grant R01-GM094402, an Alfred P. Sloan Research Fellowship, and a Packard 
Fellowship for Science and Engineering. 

References 

[1] Ethier, S. N. (1979). A limit theorem for two-locus diffusion models in population 
genetics. Journal of Applied Probability 16, 402^08. 

[2] Ethier, S. N. and Griffiths, R. C. (1990). On the two-locus samphng distribution. 
Journal of Mathematical Biology 29, 131-159. 

[3] EWENS, W. J. (1972). The sampling theory of selectively neutral alleles. Theoretical 
Population Biology 3, 87-1 12. 

[4] Fearnhead, p. and Donnelly, P. (2001). Estimating recombination rates from 
population genetic data. Genetics 159, 1299-1318. 

[5] GOLDING, G. B. (1984). The sampling distribution of Unkage disequilibrium. Genetics 
108, 257-274. 

[6] Griffiths, R. C. (1981). Neutral two-locus multiple allele models with recombination. 
Theoretical Population Biology 19, 169-186. 

[7] Griffiths, R. C, Jenkins, P. A. and Song, Y. S. (2008). Importance samphng 
and the two-locus model with subdivided population structure. Advances in Applied 
Probability 40, 473-500. 

[8] Griffiths, R. C. and Marjoram, P. (1996). Ancestral inference from samples of 
DNA sequences with recombination. Journal of Computational Biology 3, 479-502. 



Multi-locus asymptotic sampling distributions 17 

[9] Hudson, R. R. (1985). The sampling distribution of linkage disequilibrium under an 
infinite allele model without selection. Genetics 109, 611-631. 

[10] Hudson, R. R. (2001). Two-locus sampling distributions and their application. Genetics 
159, 1805-1817. 

[11] Jenkins, P. A. and Song, Y. S. (2009). Closed-form two-locus sampHng distribu- 
tions: accuracy and universality. Genetics 183, 1087-1103. 

[12] Jenkins, P. A. and Song, Y. S. (2010). An asymptotic sampling formula for 
the coalescent with recombination. Annals of Applied Probability 20, 1005-1028. 
(Technical Report 775, Department of Statistics, University of California, Berkeley, 
2009) http://arxiv.org/abs/1010.31 12vl. 

[13] Jenkins, P. A. and Song, Y. S. (2011). Pade approximants and exact two- 
locus sampling distributions. Annals of Applied Probability, in press. (Techni- 
cal Report 793, Department of Statistics, University of California, Berkeley, 2010) 
http://arxiv.org/abs/1107.3897vl. 

[14] Kingman, J. F. C. (1982). The coalescent. Stochastic Processes and Their Applications 
13, 235-248. 

[15] Kingman, J. F. C. (1982). On the genealogy of large populations. Journal of Applied 
Probability 19, 27-^3. 

[16] KuHNER, M. K., Yamato, J. and Felsenstein, J. (2000). Maximum likelihood 
estimation of recombination rates from population data. Genetics 156, 1393-1401. 

[17] McVean, G., Awadalla, P. and Fearnhead, P. (2002). A coalescent-based 
method for detecting and estimating recombination from gene sequences. Genetics 160, 

1231-1241. 

[18] McVean, G. A. T., Myers, S. R., Hunt, S., Deloukas, P., Bentley, D. R. and 
Donnelly, P. (2004). The fine-scale structure of recombination rate variation in the 
human genome. Science 304, 581-584. 

[19] Nielsen, R. (2000). Estimation of population parameters and recombination rates from 
single nucleotide polymorphisms. Genetics 154, 931-942. 

[20] Stephens, M. and Donnelly, P. (2000). Inference in molecular population genetics. 
Journal of the Royal Statistical Society: Series B 62, 605-655. 

[21] Wang, Y. and Rannala, B. (2008). Bayesian inference of fine-scale recombination 
rates using population genomic data. Philosophical Transactions of the Royal Society B 
363, 3921-3930. 

[22] Wright, S. (1949). Adaptation and selection. In Genetics, Paleontology and Evolution. 
ed. G. L. Jepson, E. Mayr, and G. G. Simpson. Princeton University Press pp. 365-389. 



